A/B Testing

Years ago I was talking with an email marketer about his A/B testing. His strategy was to send two versions of an email to an equal number of recipients. His A/B test was watching to see which version got to 100 opens first. They had it setup like a horse race where their email tool would output the number of opens for version A and version B. The team could watch the progress for both and which ever one got to 100 opens first was the winner.

As we discussed his method, I asked what happened if A was at 99 opens when B got to 100, and he just kind of looked at me, shrugged his shoulders, and quoted Dale Earnhardt with “Second place is just the first loser.”

Now, don’t get me wrong, it always comes down to context. If his race to 100 was being used to figure out which email to use for the next 5000 addresses, it’s probably not a huge deal. If they were planning to peg a quarter million dollar marketing campaign against that result, then it’s a huge deal.

I’ve talked about this before, but when we’re making decisions based on data, we always confront one key question:

Is this difference real, or just random chance (noise)?

An A/B test is the gold-standard method for assessing whether a new variation (in this case, a new email subject line) truly outperforms the old variation. It randomly splits your audience so that each version is tested with a similar cross-section of subscribers. If one version gets a higher open rate, we can be more confident it’s due to the subject line itself and not just random chance.

In the blog post below I will walk through in detail how I would go about building an A/B Test for an email marketing hypothetical.

Background:

We have a database of 25,000 potential customer email addresses. However, many have already seen our current first-email-in-the-nurture-stream.
Historically, that first email has a 10% open rate.
We have a new subject line we believe can push opens up to 15%.

In this post, we’ll:

Plan an A/B test with a sample size that’s big enough to detect a 10% → 15% difference but doesn’t require emailing all 25,000.
Set up the experiment in our marketing automation tool.
Analyze the results using R’s prop.test().

1) Why an A/B Test?

An A/B test randomly splits recipients between two versions of an email (old vs. new subject line). If one version truly outperforms the other, an A/B test will reliably detect that difference—rather than chalking it up to random fluctuation.

Key question we want to answer:

“Does our new subject line actually increase the open rate from 10% to 15%, or is any observed difference just noise?”

2) Estimating the Needed Sample Size

A) Defining Effect Size, Alpha, and Power

Effect Size: We want to detect an improvement from a 10% baseline to a 15% target. That’s a 5-percentage-point gain.
Significance Level (α): Typically 0.05 (5%), meaning we accept a 5% chance of a false positive.
Power (1 – β): Typically 0.80 (80%), meaning if a 5% difference truly exists, we have an 80% chance of detecting it as “significant.”

B) Using R’s `pwr` Package

In R, we can calculate the sample size per group with:

# install.packages("pwr") if needed 
library(pwr)

# Baseline (p1) = 0.10, Target (p2) = 0.15

p1 <- 0.10 
p2 <- 0.15

# Convert to Cohen's h

effect_size <- ES.h(p1, p2)

# Calculate needed sample size per group

pwr.2p.test(h = effect_size, sig.level = 0.05, power = 0.80, alternative = "two.sided")


     Difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.1518977
              n = 680.3527
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: same sample sizes

This suggests about 680 recipients per group for an 80% chance to detect a 10%→15% jump at the 5% significance level.

C) Practical Decision

Given 680 is the minimum per group to achieve 80% power, you might round up or go bigger (e.g., 1,000 or 1,500 each) to protect against data imperfections or drop-offs.
Avoid emailing the full 25,000: Instead, select, say, 2,000 random contacts (1,000 for control, 1,000 for treatment). That should comfortably detect the difference if it’s really 5 points or more.

2.a) A Note on Effect Size and Sample Size

There’s an important trade-off to keep in mind: the smaller the improvement we want to detect, the larger our sample needs to be. For example, if we’re looking for a jump from a 10% open rate to a 12% open rate (just 2 points), we’ll need significantly more participants in each group than if we expect a 5-point jump (10% to 15%). That’s because small differences are harder to distinguish from random fluctuations; thus, we need more data for statistical confidence.

For example, if we run the power calculation again assuming 10% & 12% then the sample size goes way up as shown below:

# Baseline (p1) = 0.10, Target (p2) = 0.12

p1 <- 0.10 
p2 <- 0.12

# Convert to Cohen's h

effect_size <- ES.h(p1, p2)

# Calculate needed sample size per group

pwr.2p.test(h = effect_size, sig.level = 0.05, power = 0.80, alternative = "two.sided")


     Difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.0639821
              n = 3834.596
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: same sample sizes

Deciding Up Front

This is why it’s critical to decide up front what level of improvement really matters to the organization. If a 2% increase in opens is valuable enough to justify the extra time and expense of a bigger test, we should power the study accordingly. Conversely, if 2% is inconsequential, we might opt for a smaller test—knowing it may only reliably detect bigger gains. This ensures we optimize our A/B testing resources based on business impact, not just academic curiosity.

3) Experiment Setup

A) Create Two Groups in Our Marketing Automation Tool

Group A (Control): ∼1,000 recipients get the old subject line.
Group B (Treatment): ∼1,000 recipients get the new subject line.

I’m assuming that the marketing automation tool handles random assignment, ensuring each user is equally likely to end up in A or B.

Important: Filter out folks who’ve previously received this email or have an existing relationship that might bias them.

B) Send and Track Opens

Send both emails at the same time or in the same short window to avoid day/time biases.
Over the next 24–72 hours (typical in email marketing), collect open data.
Export the final counts:
- Opens per group,
- Total sends (or total recipients who actually received it) per group.

4) Analyzing Results: `prop.test()` in R

Suppose we find:

Group A (Control): 1,000 emails sent, 100 opens (10% open rate).
Group B (New Subject): 1,000 emails sent, 140 opens (14% open rate).

In R:

opens <- c(100, 140) 
totals <- c(1000, 1000)

test_result <- prop.test(opens, totals, alternative = "two.sided", conf.level = 0.95)

Output Interpretation

You’ll see something like:

test_result


    2-sample test for equality of proportions with continuity correction

data:  opens out of totals
X-squared = 7.2017, df = 1, p-value = 0.007283
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.06942961 -0.01057039
sample estimates:
prop 1 prop 2 
  0.10   0.14

p-value: If it’s < 0.05, we say “there’s a statistically significant difference.”
prop 1 vs. prop 2: 10% vs. 14% in this hypothetical scenario.
Confidence Interval around the difference: e.g., ~ -1% to -7%. This means we’re 95% confident the true difference in open rates lies somewhere in that range.

Conclusion: A p-value of 0.007283 is quite small. There’s strong evidence that the new subject line outperforms the old, and we can be fairly confident the lift is more than just random chance.

5) Wrapping Up: What’s Next?

Decision: If the difference is both statistically and practically significant, we might adopt the new subject line for our broader campaign.
Further Tests: If we want to refine the subject line more—or test other email elements (like the preview text)—we can iterate with another A/B test.
Keep It Rigorous: Remember, we used fewer than 25,000 addresses to get a valid test. That’s efficient and spares us from over-emailing your entire list, many of whom already saw the original email or might not need to be included.

6) Key Takeaways

Plan for the effect size you care about: In this case, a 5% lift (10%→15%).
Calculate the sample size needed to detect that difference, ensuring sufficient power (often 80%).
Run the test on that smaller subset, not the entire 25,000 database.
Analyze with a two-proportion z-test (prop.test() in R) to see if any observed difference is more than random noise.
Interpret and act: If the data show a real lift, roll out the winning subject line more broadly.

By setting up a properly powered test, we balance not over-sending emails to our entire list with still getting trustworthy results about whether the new subject line is truly an improvement.